Skip to content

Conversation

@rhatdan
Copy link
Member

@rhatdan rhatdan commented May 15, 2025

Add new option --api which allows users to specify the API Server either llama-stack or none. With None, we just generate a service with serve command. With --api llama-stack, RamaLama will generate an API Server listening on port 8321 and a openai server listening on port 8080.

Summary by Sourcery

Add support for a unified API layer with a new --api option, implement llama-stack mode via a Stack class that generates and deploys a Kubernetes pod stack, refactor engine command helpers and label handling, update compute_serving_port and model_factory helpers, and refresh documentation and tests accordingly

New Features:

  • Add --api option with choices llama-stack or none to unify API layer handling
  • Implement llama-stack mode to generate and deploy a multi-container Kubernetes stack for API serving

Enhancements:

  • Refactor engine label handling into a generic add_labels helper and simplify container manager commands (inspect, stop_container, container_connection)
  • Refactor compute_serving_port to accept args, respect the api option, and display appropriate REST API endpoints
  • Introduce New and Serve helpers in model_factory for consistent model instantiation

Documentation:

  • Document the new api option in ramalama.conf and update CLI manpages for run and serve commands

Tests:

  • Add tests for compute_serving_port behavior with args and api settings and remove outdated stop_container test
  • Update config loading tests to include default and overridden api values

@sourcery-ai
Copy link
Contributor

sourcery-ai bot commented May 15, 2025

Reviewer's Guide

This PR adds optional llama-stack API support via a new --api flag that configures unified serving endpoints, implements a Stack generator for multi-container YAML deployments, refactors the engine module for consolidated labeling and port handling plus container utilities, centralizes model instantiation and serve logic in model_factory, and updates documentation and tests to reflect these changes.

Sequence diagram for serve command with llama-stack

sequenceDiagram
    actor User
    participant CLI
    participant Stack
    participant Podman

    User->>CLI: ramalama serve <MODEL> --api llama-stack --port <host_port>
    CLI->>CLI: Parse arguments
    CLI->>CLI: args.port = compute_serving_port(args)
    CLI->>Stack: new Stack(args)
    Stack-->>CLI: stack instance
    CLI->>Stack: serve()
    Stack->>Stack: generate() YAML for Podman Kube
    Stack->>Podman: podman kube play --replace <generated_yaml_file>
    Podman-->>Stack: Deployment status
    Stack-->>CLI: Status
    CLI->>User: Output (LamaStack & OpenAI API URLs)
Loading

File-Level Changes

Change Details Files
Add unified API layer (--api llama-stack) with configurable serving ports and behavior
  • Add --api CLI option to run/serve commands
  • Set default api in configuration and conf files
  • Extend compute_serving_port to handle LamaStack vs OpenAI endpoints
  • Validate container usage for llama-stack mode in serve path
ramalama/cli.py
ramalama/config.py
ramalama/model.py
docs/ramalama.conf
docs/ramalama.conf.5.md
docs/ramalama-run.1.md
docs/ramalama-serve.1.md
test/unit/test_model.py
test/unit/test_config.py
Introduce Stack class for multi-container llama-stack deployments
  • Implement Stack to generate and apply multi-container YAML
  • Wire stack.serve() path into CLI when --api llama-stack
ramalama/stack.py
ramalama/cli.py
Refactor engine to consolidate labeling, port mapping, and container utilities
  • Replace add_container_labels with generic add_labels helper
  • Add add_name, cap_add, and improved add_port_option logic
  • Extract inspect and container_connection helpers and revise stop_container
ramalama/engine.py
Centralize model instantiation and serve logic in model_factory
  • Move New and Serve helpers into model_factory.py
  • Adjust model.py to call engine.add before command args
  • Unify serve invocation via new factory helpers
ramalama/model_factory.py
ramalama/model.py
ramalama/cli.py
Update tests to reflect new API and refactored functions
  • Adapt compute_serving_port tests for updated signature
  • Remove stale test_stop_container in engine tests
  • Add coverage for config api default and CLI choices
test/unit/test_engine.py
test/unit/test_model.py
test/unit/test_config.py

Possibly linked issues


Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

Copy link
Contributor

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @rhatdan - I've reviewed your changes and found some issues that need to be addressed.

Blocking issues:

  • Invalid raise () syntax (link)

  • Undefined self.engine in Stack.Serve (link)

  • Replace the bare raise () in serve_cli with a proper exception type and descriptive message when --container is missing.

  • The new container_connection function in engine.py isn’t referenced anywhere—consider integrating it where needed or removing it to avoid dead code.

  • Manually concatenating YAML strings in Stack.generate is brittle; consider using a YAML library or templating for more robust and maintainable serialization.

Here's what I looked at during the review
  • 🔴 General issues: 2 blocking issues, 7 other issues
  • 🟢 Testing: all looks good
  • 🟡 Complexity: 1 issue found
  • 🟢 Documentation: all looks good

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

raise ValueError("no container manager (Podman, Docker) found")

conman_args = [conman, "port", name, port]
output = run_cmd(conman_args, debug=args.debug).stdout.decode("utf-8").strip()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue: Parsing docker port output by splitting on '>' is brittle

This breaks if Docker/Podman returns multiple mappings or uses a different format. Use a regex or docker port --format structured output for more reliable parsing.

def add_port_option(self):
if hasattr(self.args, "port"):
self.exec_args += ["-p", f"{self.args.port}:{self.args.port}"]
if hasattr(self.args, "port") and self.args.port != "":
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (bug_risk): Port handling may error if args.port is non-string or None

Avoid .count() on None or ints by casting args.port to str or checking truthiness (e.g., if self.args.port:) first.

ramalama/cli.py Outdated

if args.api == "llama-stack":
if not args.container:
raise ()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (bug_risk): Invalid raise () syntax

Replace raise () with a specific exception, e.g.:

raise ValueError('args.container is required')

volumes:{volumes}"""
return self.stack_yaml

def Serve(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (bug_risk): Undefined self.engine in Stack.Serve

self.engine isn't initialized, so calling self.engine.run() will raise an AttributeError. Inject or instantiate an engine, or remove this call.

self.model_port = str(int(self.args.port) + 1)
self.stack_image = tagged_image("quay.io/ramalama/llama-stack")

def generate(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue: Generated YAML retains unintended indentation

Use textwrap.dedent or construct the YAML string without indentation to avoid extra leading spaces from the triple-quoted string.

seLinuxOptions:
type: spc_t"""

self.stack_yaml = f"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (complexity): Consider building the YAML manifest using Python dicts and yaml.safe_dump() instead of string concatenation for clarity and maintainability.

Consider replacing all the raw‐string/YAML concatenation with a Python dict + `yaml.safe_dump()`.  This keeps your intent explicit, eliminates all the nesting/quotes, and is much easier to maintain.

Example:

```python
import yaml
from ramalama.common import tagged_image
from ramalama.model_factory import ModelFactory, New

class Stack:
    def __init__(self, args):
        self.args = args
        self.host = "127.0.0.1"
        mf = ModelFactory(args.MODEL, args)
        self.model = mf.prune_model_input()
        nw = New(args.MODEL, args)
        self.model_type = nw.type
        self.model_path = nw.get_model_path(args)
        self.model_port = str(int(args.port) + 1)
        self.stack_image = tagged_image("quay.io/ramalama/llama-stack")

    def generate(self):
        # build lists/dicts instead of strings
        volume_mounts = [
            {'mountPath': '/mnt/models/model.file', 'name': 'model'},
            {'mountPath': '/dev/dri',             'name': 'dri'},
        ]
        if self.model_type == "OCI":
            volume_mounts[0] = {
                'mountPath': '/mnt/models',
                'subPath':  '/models',
                'name':     'model'
            }

        volumes = [
            {'hostPath': {'path': self.model_path}, 'name': 'model'},
            {'hostPath': {'path': '/dev/dri'},      'name': 'dri'},
        ]

        llama_cmd = [
            "llama-server", "--port", self.model_port,
            "--model", "/mnt/models/model.file",
            "--alias", self.model, "--ctx-size", self.args.context,
            "--temp", self.args.temp, "--jinja", "--cache-reuse", "256",
            "-v", "--threads", self.args.threads,
            "--host", self.host,
        ]

        security_ctx = {
            'allowPrivilegeEscalation': False,
            'capabilities': {
                'drop': [
                    "CAP_CHOWN", "CAP_FOWNER", "CAP_FSETID",
                    "CAP_KILL", "CAP_NET_BIND_SERVICE",
                    "CAP_SETFCAP", "CAP_SETGID",
                    "CAP_SETPCAP", "CAP_SETUID",
                    "CAP_SYS_CHROOT"
                ],
                'add': ["CAP_DAC_OVERRIDE"]
            },
            'seLinuxOptions': {'type': 'spc_t'}
        }

        deployment = {
            'apiVersion': 'v1',
            'kind':       'Deployment',
            'metadata': {
                'name':   self.args.name,
                'labels': {'app': self.args.name}
            },
            'spec': {
                'replicas': 1,
                'selector': {'matchLabels': {'app': self.args.name}},
                'template': {
                    'metadata': {'labels': {'app': self.args.name}},
                    'spec': {
                        'containers': [
                            {
                                'name':             'model-server',
                                'image':            self.args.image,
                                'command':          ["/usr/libexec/ramalama/ramalama-serve-core"],
                                'args':             llama_cmd,
                                'securityContext':  security_ctx,
                                'volumeMounts':     volume_mounts,
                            },
                            {
                                'name': 'llama-stack',
                                'image': self.stack_image,
                                'args':  ['/bin/sh', '-c', 'llama stack run --image-type venv /etc/ramalama/ramalama-run.yaml'],
                                'env': [
                                    {'name': 'RAMALAMA_URL',     'value': f"http://127.0.0.1:{self.model_port}"},
                                    {'name': 'INFERENCE_MODEL',  'value': self.model},
                                ],
                                'securityContext': security_ctx,
                                'ports': [{'containerPort': 8321, 'hostPort': self.args.port}],
                            }
                        ],
                        'volumes': volumes
                    }
                }
            }
        }

        return yaml.safe_dump(deployment, sort_keys=False)

Benefits:

  • No more fragile indentation/f‐strings
  • Clear data model of your Kubernetes manifest
  • Easy to tweak/add/remove fields
  • Full fidelity: yaml.safe_dump() produces valid YAML under the hood

Comment on lines +143 to +168
except Exception:
raise e
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (code-quality): Explicitly raise from a previous error (raise-from-previous-error)

Suggested change
except Exception:
raise e
except Exception as exc:
raise e from exc

@rhatdan rhatdan force-pushed the llama-stack branch 5 times, most recently from cd15235 to 609b47e Compare May 16, 2025 14:24
@rhatdan rhatdan changed the title [WIP] Add support for llama-stack Add support for llama-stack May 28, 2025
@rhatdan rhatdan force-pushed the llama-stack branch 4 times, most recently from e635313 to 517a7c9 Compare May 28, 2025 19:10
@rhatdan
Copy link
Member Author

rhatdan commented May 28, 2025

@sourcery-ai Please review my PR

@sourcery-ai
Copy link
Contributor

sourcery-ai bot commented May 28, 2025

Sure! I'm generating a new review now.

Copy link
Contributor

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @rhatdan - I've reviewed your changes and they look great!

Blocking issues:

  • file object opened without corresponding close (link)
  • Using 'yaml_file.name' without '.flush()' or '.close()' may cause an error because the file may not exist when 'yaml_file.name' is used. Use '.flush()' or close the file before using 'yaml_file.name'. (link)
  • Using 'yaml_file.name' without '.flush()' or '.close()' may cause an error because the file may not exist when 'yaml_file.name' is used. Use '.flush()' or close the file before using 'yaml_file.name'. (link)
  • file object opened without corresponding close (link)
  • Using 'yaml_file.name' without '.flush()' or '.close()' may cause an error because the file may not exist when 'yaml_file.name' is used. Use '.flush()' or close the file before using 'yaml_file.name'. (link)
  • Using 'yaml_file.name' without '.flush()' or '.close()' may cause an error because the file may not exist when 'yaml_file.name' is used. Use '.flush()' or close the file before using 'yaml_file.name'. (link)
Here's what I looked at during the review
  • 🟡 General issues: 1 issue found
  • 🔴 Security: 6 blocking issues
  • 🟢 Testing: all looks good
  • 🟢 Complexity: all looks good
  • 🟢 Documentation: all looks good

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

Comment on lines +38 to +36
cleanlabel = label.replace("=", ": ", 1)
self.labels = f"{self.labels}\n {cleanlabel}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (bug_risk): Label formatting replaces '=' with ': ', which may not be YAML-safe for all label values.

If label values include special characters or newlines, the output may be invalid YAML. Consider using a YAML library or properly escaping values.

if self.args.dryrun:
print(yaml)
return
yaml_file = tempfile.NamedTemporaryFile(prefix='RamaLama_', delete=not self.args.debug)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security (open-never-closed): file object opened without corresponding close

Source: opengrep

print(yaml)
return
yaml_file = tempfile.NamedTemporaryFile(prefix='RamaLama_', delete=not self.args.debug)
with open(yaml_file.name, 'w') as c:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security (tempfile-without-flush): Using 'yaml_file.name' without '.flush()' or '.close()' may cause an error because the file may not exist when 'yaml_file.name' is used. Use '.flush()' or close the file before using 'yaml_file.name'.

Source: opengrep

"kube",
"play",
"--replace",
yaml_file.name,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security (tempfile-without-flush): Using 'yaml_file.name' without '.flush()' or '.close()' may cause an error because the file may not exist when 'yaml_file.name' is used. Use '.flush()' or close the file before using 'yaml_file.name'.

Source: opengrep

exec_cmd(exec_args, debug=self.args.debug)

def stop(self):
yaml_file = tempfile.NamedTemporaryFile(prefix='RamaLama_', delete=not self.args.debug)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security (open-never-closed): file object opened without corresponding close

Source: opengrep

Comment on lines 90 to 101
data = {
"stream": True,
"messages": self.conversation_history,
"model": self.models[0],
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (code-quality): Inline variable that is immediately returned (inline-immediately-returned-variable)

Comment on lines 321 to 322
value = getattr(args, arg)
if value:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (code-quality): Use named expression to simplify assignment and conditional (use-named-expression)

Suggested change
value = getattr(args, arg)
if value:
if value := getattr(args, arg):


# otherwise compute a random serving port in the range
target_port = get_available_port_if_any(debug)
if args.port != "" and args.port != str(DEFAULT_PORT):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (code-quality): Replace multiple comparisons of same variable with in operator (merge-comparisons)

Suggested change
if args.port != "" and args.port != str(DEFAULT_PORT):
if args.port not in ["", str(DEFAULT_PORT)]:

Comment on lines +167 to +168
except Exception:
raise e
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (code-quality): Explicitly raise from a previous error (raise-from-previous-error)

Suggested change
except Exception:
raise e
except Exception as exc:
raise e from exc

Comment on lines 20 to 24
if hasattr(args, "name") and args.name:
self.name = args.name
else:
self.name = genname()

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (code-quality): Replace if statement with if expression (assign-if-exp)

Suggested change
if hasattr(args, "name") and args.name:
self.name = args.name
else:
self.name = genname()
self.name = args.name if hasattr(args, "name") and args.name else genname()

@sourcery-ai
Copy link
Contributor

sourcery-ai bot commented May 28, 2025

Hey @rhatdan, I've posted a new review for you!

@rhatdan rhatdan force-pushed the llama-stack branch 5 times, most recently from f0a4c73 to 62610b7 Compare May 29, 2025 13:15
@rhatdan rhatdan force-pushed the llama-stack branch 4 times, most recently from 33c8488 to 48fa9ed Compare May 29, 2025 17:34
@bmahabirbu
Copy link
Collaborator

Looks good so far! Seems a test is failing, saying no connection I wonder if that has to do with the endpoints or fixing the test maybe

@ericcurtin ericcurtin requested a review from Copilot May 30, 2025 05:13
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Adds a --api flag to enable a unified API layer, implements a new llama-stack mode with Kubernetes pod generation, and refactors port computation and label handling.

  • Introduce --api option and default in configuration
  • Implement Stack class to generate/deploy llama-stack Kubernetes YAML
  • Refactor compute_serving_port to accept args and print REST API endpoints

Reviewed Changes

Copilot reviewed 18 out of 18 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
test/unit/test_model.py Updated tests for new compute_serving_port(args, …) signature
test/unit/test_engine.py Removed outdated stop_container import and test
test/unit/test_config.py Added default and overridden api value to config tests
test/system/040-serve.bats Removed assertions on old port output in serve tests
ramalama/stack.py New Stack class to support llama-stack API mode
ramalama/shortnames.py Flush file after writing shortnames
ramalama/rag.py Flush file after writing containerfile in RAG build
ramalama/oci.py Flush file after writing containerfile in OCI build
ramalama/model_factory.py Add New and Serve helpers for model instantiation
ramalama/model.py Refactor compute_serving_port to use args and handle api
ramalama/engine.py Generalize label handling and add container helpers
ramalama/config.py Set default api to none
ramalama/cli.py Register --api option and route llama-stack serve
libexec/ramalama/ramalama-client-core Revamped interactive shell for API-driven client
docs/ramalama.conf.5.md Document api setting in manpage
docs/ramalama.conf Document api setting in sample config
docs/ramalama-serve.1.md Document --api in serve CLI manpage
docs/ramalama-run.1.md Document --api in run CLI manpage
Comments suppressed due to low confidence (2)

test/unit/test_model.py:118

  • The mock_compute_ports is defined but never applied to compute_serving_port; you may need to patch get_available_port_if_any or remove the unused mock.
mock_compute_ports = Mock(return_value=expectedRandomizedResult)

test/system/040-serve.bats:13

  • Existing tests for port output were removed but should be updated to assert the new REST API endpoint messages printed by compute_serving_port.
assert "$output" =~ "serving on port .*"

return res(response, self.parsed_args.color)

print(f"\rError: could not connect to: {self.url}", file=sys.stderr)
self.kills(self.parsed_args)
Copy link

Copilot AI May 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Method kills is defined without parameters but is called with self.parsed_args, leading to a TypeError. Call self.kills() instead.

Suggested change
self.kills(self.parsed_args)
self.kills()

Copilot uses AI. Check for mistakes.
seLinuxOptions:
type: spc_t"""

self.stack_yaml = f"""
Copy link

Copilot AI May 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] Building large YAML via f-strings with embedded indentation is error-prone. Consider using a YAML templating library or a structured builder for readability and reliability.

Copilot uses AI. Check for mistakes.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@copilot could you rewrite this section to use YAML templating library?

@rhatdan rhatdan force-pushed the llama-stack branch 2 times, most recently from 103ac92 to fd311d1 Compare May 30, 2025 10:01
@rhatdan
Copy link
Member Author

rhatdan commented May 30, 2025

@ericcurtin @bmahabirbu @nathan-weinberg @cdoern This is ready to go in. PTAL

@rhatdan rhatdan force-pushed the llama-stack branch 3 times, most recently from 2ee5ae9 to b3c822f Compare May 30, 2025 12:58
Add new option --api which allows users to specify the API Server
either llama-stack or none.  With None, we just generate a service with
serve command. With `--api llama-stack`, RamaLama will generate an API
Server listening on port 8321 and a openai server listening on port
8080.

Signed-off-by: Daniel J Walsh <[email protected]>
@nathan-weinberg nathan-weinberg self-requested a review May 30, 2025 13:27
@ericcurtin ericcurtin merged commit ee7cb50 into containers:main May 31, 2025
15 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants